Local and Overall Deviance R-Squared Measures for Mixtures of Generalized Linear Models

In generalized linear models (GLMs), measures of lack of fit are typically defined as the deviance between two nested models, and a deviance-based R2 is commonly used to evaluate the fit. In this paper, we extend deviance measures to mixtures of GLMs, whose parameters are estimated by maximum likelihood (ML) via the EM algorithm. Such measures are defined both locally, i.e., at cluster-level, and globally, i.e., with reference to the whole sample. At the cluster-level, we propose a normalized two-term decomposition of the local deviance into explained, and unexplained local deviances. At the sample-level, we introduce an additive normalized decomposition of the total deviance into three terms, where each evaluates a different aspect of the fitted model: (1) the cluster separation on the dependent variable, (2) the proportion of the total deviance explained by the fitted model, and (3) the proportion of the total deviance which remains unexplained. We use both local and global decompositions to define, respectively, local and overall deviance R2 measures for mixtures of GLMs, which we illustrate—for Gaussian, Poisson and binomial responses—by means of a simulation study. The proposed fit measures are then used to assess, and interpret clusters of COVID-19 spread in Italy in two time points.


Introduction
In the framework of the classical linear model, the coefficient of determination, also known as R-squared (R 2 ), is a widely used goodness of fit measure, whose advantages and limitations are well-known in literature (see, e.g., Cameron and Windmeijer (1997) and Cohen et al. (2013)). For generalized linear models (GLMs), measures of lack of fit are typically defined based on the deviance, which compares the log-likelihoods of two nested GLMs. In this context, the deviance-based R 2 is extensively used to assess goodness of fit. In this paper, we focus on mixtures of GLMs (see, e.g., Kamakura, 2000, Chapter 7 andFrühwirth-Schnatter, 2006 Chapter 8), whose parameters are estimated by ML. These models represent a classical generalization of a single GLM, designed to handle data that are clustered around (generalized) linear structures. We introduce different deviance measures considering both cluster, and sample levels.
At cluster-level, where the clusters are identified by the a posteriori soft partition provided by the fitted model (defined as "soft" since each unit's memberships to the clusters are in the form of probabilities), we define a local deviance. The latter compares, for each cluster, the intercept-only GLM with the saturated GLM. We show that the newly defined local deviance can be decomposed into an explained local deviance, and a remained part, which is left unexplained by the local GLM.
At sample-level, we first define the total deviance by comparing the intercept-only GLM mixture with the saturated GLM mixture. Subsequently, we propose to decompose it into a normalized summation of three terms. Each additive term evaluates a different aspect of the fitted model: cluster separation on the dependent variable, the proportion of total deviance which is explained by the fitted model, the proportion of the total deviance which remains unexplained. We leverage on these new decompositions to define a local, and an overall deviance R 2 measure.
The paper is organized as follows. In Section 2, we set up the baseline framework of GLMs, devoting special emphasis to the concepts of deviance and R 2 . In Section 3, we introduce mixtures of GLMs and propose our local and overall deviance measures for this class of models. In Section 4, we derive the normalized decompositions of the local and total deviances, and conclude the section by introducing our local and overall deviance R 2 s. The proposal is first illustrated by means of three simulation studies in Section 6-for conditional Gaussian, Poisson, and binomial response types, respectively. The simulation for the Gaussian case includes an evaluation of the impact of choosing among the most popular initialization strategies. In Section 7, our fit measures are used to assess, and interpret clusters of COVID-19 prevalence in Italy in two points in time. We conclude the paper (Section 8) with a final discussion, and possible venues for future work.

Preliminaries About Generalized Linear Models
Let (x i , y i ) , i = 1, . . . , n, be independent observations from (X , Y ) , where Y is the dependent (or response) variable and X is a vector of d explanatory variables-which we call covariates, regressors, or predictors interchangeably. In GLMs, we assume that the conditional distribution of Y i given X i = x i is a member of the exponential family with probability density (or mass) function where θ i = θ(x i ) is the canonical parameter, which is a function of x i , φ (if unknown) is a dispersion parameter, and a(·), b(·), and c(·) are known functions varying from one member of the family to another. If φ is known, the model (1) is a member of the (one-parameter) exponential family; if φ is unknown, the model (1) may, or may not be a member of the twoparameter exponential family. The function b(·) describes the relationship between the mean of Y i , denoted by μ i , and the canonical parameter θ i , given by μ i = b (θ i ), where prime denotes differentiation with respect to θ i (see, e.g., McCullagh & Nelder, 1989, pp. 28-29).
In GLMs, a monotone and differentiable link function g (·) is exploited to relate μ i to the covariates x i through the relation where η(x i ; β) = β x * i is the so-called linear predictor, with β ∈ R d+1 and x * i = (1, x i ) to include the intercept term. Note that (2) can be inverted to obtain μ i = g −1 (η i ). As far as the choice of g(·) is concerned, more specialized results can be obtained by choosing the canonical link function such that θ i = η i , and the density in (1) can be easily expressed as a function of μ i , rather than using the canonical parameter θ i , as Table 1 specifies all the quantities defined so far for three well-known members of the exponential family: the Gaussian, Poisson, and binomial distributions.

Maximum Likelihood Estimation
For GLMs, the estimates of the regression coefficients β, and of the dispersion parameter φ (in the case of a two-parameter exponential family distribution), are typically obtained by the ML method. The log-likelihood function for a GLM, with density in Eq. 3, is given by where μ is an n × 1 vector, with ith entry equal to μ i . We denote by β, and φ the ML estimates of β and φ, respectively, and by μ i = μ x i ; β the fitted value of Y i .

Lack of Fit
In GLMs, the deviances replace the sums of squares (SS) of ordinary least squares (OLS) regression as the building blocks to define measures of lack of fit to the data of the GLM (see, e.g., Cohen et al., 2013). Notably, deviance measures are built from the maximum log-likelihoods of three models: the most parsimonious intercept-only model (null model), the model we are interested in (fitted model), and the least parsimonious model, with n parameters, providing a perfect fit (saturated model). The null and saturated models are defined so that μ i =ȳ, and μ i = y i (i = 1, . . . , n), respectively, withȳ being the sample mean of Y .
Each deviance is a measure of lack of fit, and is obtained as twice the difference between the log-likelihood of one model, compared to the log-likelihood of another (nested) model. Therefore, the larger the value of deviance for the nested model, the worse its goodness of fit. The two most used deviances are the null deviance and the fitted deviance where y, μ, andȳ are the n × 1 vectors with ith entry being y i , μ i , andȳ, respectively, and φ is the ML estimate of φ under the fitted model. As well-motivated by Cameron and Windmeijer (1996), the same estimate of φ is used for all the models (null, fitted and saturated), as we wish to measure the fit due to the regressors, and not to the dispersion parameter.
The null deviance in Eq. 5 is analogous to the total sum of squares (TSS), that is, the total variation in the dependent variable Y from the OLS regression. This measures the discrepancy between the worst, and the best possible models, i.e., all the discrepancy that the (fitted) model can potentially account for.
The fitted deviance in Eq. 6 is analogous to the residual sum of squares (RSS) from OLS regression. This deviance measures the lack of fit after modeling with d predictors. Even if it is not as widespread in the literature, in principle we could also define a sort of "explained" deviance as which compares the null and fitted models, analogously to the explained sum of squares (ESS) from OLS regression. Explained and residual deviances allow us to decompose the null deviance as

Deviance R 2
In OLS regression, the R 2 is a universal and agreed-upon index of model goodness of fit measuring the proportion of total variation in Y accounted for by a set of d predictors. No single agreed-upon index of goodness of fit exists for GLMs, although several approaches have been proposed. These pseudo-R 2 s are based on different definitions of residuals, the most common of which are the raw residuals, the Pearson residuals, and the deviance residuals (Cameron & Windmeijer, 1996). Note that none of these indexes is a goodness of fit measure, in the sense that none can be interpreted as "proportion of variance accounted for," as in the OLS regression. For GLMs, the deviance R 2 is one of the favorite indexes of goodness of fit by applied and theoretical researchers Windmeijer, 1997 andMenard, 2002); it is based on the deviance residuals, and it is sometimes the only measure discussed in articles/textbooks (Guisan & Zimmermann, 2000) and implemented in statistical software (Crawley, 2012, Chapter 13). Intuitively, it looks similar to the R 2 = ESS/TSS of simple linear regression, where the sums of squares are replaced with the deviance measures defined in (5) and (7). Its formula is given by Due to the two-term decomposition (8), also this index ranges between zero and one.

Mixtures of GLMs and Their Complete-Data Log-Likelihood
Suppose that the conditional expectation of Y i given X i = x i varies across the k levels (groups, clusters or classes), labeled as 1, . . . , k, of a categorical (nominal) latent variable J i . Under this assumption, mixtures of GLMs constitute a reference framework of analysis (see, e.g., (McLachlan & Peel, 2000, Chapter 5) and (Wedel & Kamakura, 2000, Chapter 7)). The density of a generic observation y i , according to a mixture of k GLMs, can be written as where f (y i ; μ ij , φ j ) is the j th component density of y i defined like in Eq. 3, with mean μ ij = E (Y i |X i = x i , J i = j ) and dispersion parameter φ j (in the case of a two-parameter exponential family), and π j = P (J i = j), with π j > 0 and k j =1 π j = 1. In Eq. 11, ψ denotes the set of all parameters of the model, where π 1 , . . . , π k−1 are the weights, β 1 , . . . , β k are the local regression coefficients, and, in the case of a two-parameter exponential family distribution for Y i |X i = x i , J i = j , φ 1 , . . . , φ k are the local dispersion parameters. General conditions for the identifiability of model Eq. 11 are given in Grün and Leisch (2008a) (see also Frühwirth-Schnatter (2006)).
ML estimates of the parameters are usually obtained via the expectation-maximization (EM) algorithm (Dempster et al., 1977). The core quantity of the algorithm is the completedata log-likelihood where, based on Eq. 4, with z ij = 1 if (x i , y i ) comes from the j th component, and z ij = 0 otherwise. The EM algorithm iterates between the expectation-step (E-step) and the maximizationstep (M-step) until convergence. At the generic iteration, in the E-step, given the current estimate of ψ from the previous iteration, sayψ, each z ij in Eq. 12 is replaced by the corresponding a posteriori probability of group membership (expectation of Z ij ), sayz ij . This leads to the expected complete-data log-likelihood function. In the M-step, this function is maximized with respect to ψ; as the two terms on the right-hand side of Eq. 12 have zero cross-derivatives, they can be maximized separately.
The maximization of the expected complete-data log-likelihood function is equivalent to the maximization problem discussed in Section 2.1 (for the complete data), except that each observation (x i , y i ) contributes to the log-likelihood with a known weightz ij , and Mazza et al., 2018. See, e.g., (Wedel & De Sarbo, 1995), and (Wedel & Kamakura, 2000, pp. 120-124) for a complete review of the EM algorithm for fitting the model Eq. 11.
Once the model Eq. 11 is fitted, each observation (x i , y i ) is classified into one of the k categories according to the maximum a posteriori probability (MAP) estimate: MAP( z ij ) = 1 if max h { z ih } occurs in cluster j (j = 1, . . . , k), and 0 otherwise, where z ij denotes the output value ofz ij at convergence of the EM algorithm.

Deviance Measures
Based on the arguments given earlier, hereafter we introduce the deviance measures of lack of fit for the mixture model Eq. 11 both locally (for each group j defined by the soft partition z ij , i = 1, . . . , n), and overall (for the whole sample).
Let us introduce what we call the local null model, i.e., the local intercept-only model. In notation terms, such a model assumes μ ij =ȳ j , whereȳ j = n i=1 z ij y i / n j . The quantity n j = n i=1 z ij denotes the expected (soft) size of the j th group according to the fitted model, j = 1, . . . , k. The local null model can be considered as in between the null and the fitted models. The wording "soft" is used because the group memberships z ij , i = 1, . . . , n and j = 1, . . . , k, are a posteriori probabilities-and not "hard" 0/1 values.
For each soft group j , j = 1, . . . , k, we define the local null deviance (or simply local deviance) the local fitted deviance (or local residual deviance) (15) and the local explained deviance where μ j andȳ j are n × 1 vectors with ith entry being μ ij , andȳ j , respectively.
In (14)-(16), since the focus is on measuring the fit due to the regressors, all considered models (null, fitted, and saturated) are evaluated at the soft partition z ij (i = 1, . . . , n and j = 1, . . . , k), and at the ML estimate φ j of φ j under the fitted model (refer to Section 2.2 and to (Cameron & Windmeijer, 1996)). As a consequence, the estimate of the weight π j is the same for all models, and this is the reason why it vanishes (by simplification) from Eqs. 14-16. In analogy with Eq. 8, it is easy to realize that for j = 1, . . . , k.
For the full sample, we define the null deviance (or total deviance) the (soft) within-group deviance (or simply within deviance) and the (soft) between-group deviance (or simply between deviance) where The between deviance in Eq. 20 measures the discrepancy between the intercept-only model, and the local intercept-only model. In terms of clustering validation: 1. BD can be seen as a separation measure (Cerdeira et al., 2012) see, e.g., indicating how well-separated clusters (represented byȳ 1 , . . . ,ȳ k ) are along the y-axis (the greater the value of BD, the more "separated" the clusters are along Y ); 2. WD measures the discrepancy between the best possible saturated model, and the local intercept-only model. WD can be seen as a compactness measure (see, e.g., Panagiotakis, 2015), quantifying how close observations in a cluster are with respect to the average response of that cluster (the smaller the value of WD, the more "compact" the clusters are around their average response).
From Eqs. 19 and 20, we obtain the two-term decomposition of the total deviance as The within deviance WD can be further decomposed as is the explained within deviance, and is the residual within deviance. Two important remarks follow.
1. The explained within deviance EWD measures the discrepancy between the fitted model and the local intercept-only model; in particular, based on Eq. 23, EWD can be considered as the part of WD explained by the local models involving the covariates. 2. The residual within deviance RWD measures the discrepancy between the best possible saturated model, and the fitted model; specifically, based on Eq. 23, RWD can be considered as the part of WD which we are not able to predict locally based on the covariates.
Finally, substituting (23) in (22), we obtain the final three-term decomposition of the total deviance as As a special case, when k = 1, the BD term in Eq. 26 vanishes and TD = EWD + RWD, which is the null deviance decomposition Eq. 8 for the GLM.

Some Special Cases
For illustrative purposes, we compute the local deviance measures, introduced in Section 3.2, in the case of three well-known exponential family distributions for : Gaussian (Section 3.3.1), Poisson (Section 3.3.2), and binomial (Section 3.3.3).

Gaussian Case
The Gaussian distribution is the only two-parameter exponential family distribution we consider herein. This choice for the response variable in model Eq. 11 leads to mixtures of linear Gaussian regressions. For these mixtures, we recall that measures of lack of fit, based on sums of squares, have been already introduced in (Ingrassia & Punzo, 2020).
Using the notation in Table 1, we are assuming that Simple algebra allows to simplify the local deviances in Eqs. 14, 15, 16, and Eq. 21, yielding These deviance-based lack of fit measures differ from those in (Ingrassia & Punzo, 2020) as now the component error variances σ 2 j enter the decomposition. In terms of residuals, this means that, while the soft raw residuals are considered in (Ingrassia & Punzo, 2020), here the soft Pearson (standardized) residuals (which in this case coincide with the deviance residuals) are considered. This is a favorable improvement allowing us to compare the fit between clusters with different local conditional variances σ 2 1 , . . . , σ 2 k .

Poisson Case
Using the notation in Table 1, here we assume that Simple algebra leads to simplify the local deviances in Eqs. 14, 15, 16, and 21, so to obtain

Binomial Case
Using the notation in Table 1, here we assume that Simple algebra leads to simplify the local deviances in Eq. 14, 15, 16, and 21, yielding

Normalized Three-Term Decomposition of the Total Deviance
Starting from the three-term decomposition of the total deviance in Eq. 26, it is possible to define normalized deviance measures which evaluate the main aspects of the fitted model. In particular, dividing both sides of Eq. 26 by TD, we obtain where NBD, NEWD, and NRWD are the normalized versions of BD, EWD, and RWD, respectively.
In terms of interpretation, NBD is the proportion of the total deviance explained by the separation measure BD; hence, NBD can seen as a sort of correlation ratio measuring the association between the dependent variable Y , and the latent group variable J . NEWD is the proportion of the total deviance explained by the inclusion of the covariates X-through the slope(s) of the local regressions. Instead, NRWD represents the proportion of the total deviance, which remains unexplained by the fitted model.

Normalized Explained Deviance
Exploiting Eq. 27, it is natural to introduce the quantity NED represents the proportion of the total deviance explained by the fitted model, desirably assuming values in the interval [0, 1]. The larger its value (hence, the smaller NRWD), the better the fit of the mixture of GLMs to the observed data. Provided that TD > 0, the limit cases NED = 0 and NED = 1 are obtained when NBD = NEWD = 0 and NRWD = 0, respectively. Cases where either of the three terms NBD, NEWD, and NRWD are null, are analyzed below.

Local and Overall Deviance R 2 Measures
Leveraging on Eq. 10, it is also natural to define the local deviance R 2 for the j th group as R 2 j can be seen as the proportion of the local deviance in the j th group that cannot be explained by the intercept-only GLM in that group, but which can by the linear predictor η ij = β j x * i of the GLM. As a general note, the higher the R 2 j , the better the j th GLM fits the data in the j th group. In other words, the larger the fraction of local deviance in group j that is accounted for by the j th GLM, the closer the data points are to the fitted cluster's regression line.
With the same principle, it is natural to define the overall deviance R 2 as Intuitively, the overall R 2 in Eq. 30 can be interpreted as the proportion of the within deviance explained (accounted for) by the fitted mixture of GLMs. Based on Eq. 24, R 2 is related to R 2 1 , . . . , R 2 k by the following relation According to Eq. 31, R 2 can be seen as a weighted average of R 2 1 , . . . , R 2 k , with normalized weights D 1 /WD, . . . , D k /WD being the proportions of the within deviance due to each local deviance.
All the deviance measures introduced so far are summarized in Table 2, which provides the expressions, and a link between them, along with a short textual description.

Potential Limitations
Advancing the above ideas to define "adjusted" local and overall deviance R 2 measures, similarly to the classical adjusted deviance R 2 for GLMs (Guisan & Zimmermann, 2000, p. 167), to compare models with alternative nested/nonnested sets of covariates and/or with different number of latent groups, would seem natural. Below, we try to explain why, in our opinion, such an exercise makes no sense in the context of mixtures of GLMs.

GLMs
Starting from Eq. 9, and similarly to the adjusted R 2 for the OLS regression, the adjusted deviance R 2 for GLMs is defined (Guisan & Zimmermann, 2000, p. 167) as where n − (d + 1), and n − 1 represent the so-called number of degrees of freedom of D μ, y; φ and D ȳ, y; φ , respectively.
Soft contribution of cluster j to the between deviance.
The primary attractiveness of R 2 is that it imposes a penalty for adding additional independent variables to the GLM. The second related attractiveness of R 2 is that it can be used to choose between nested/nonnested GLMs, with the aim of selecting the best set of explanatory variables (variable/model selection).

Mixtures of GLMs
The arguments of Section 5.1 can be easily extended to the local and overall deviance R 2 measures introduced in Section 4.3.
Adjusted local deviance R 2 . In the spirit of Eq. 32, the adjusted local deviance R 2 for mixtures of GLMs, in the generic j th group, could be defined as R 2 j implicitly assumes that the (soft) sample, of size n j , is defined by the posterior probabilities z ij , i = 1, . . . , n. According, for example, to Gujarati and Porter (2009), the number of degrees of freedom n j − (d + 1) and n j − 1 in R 2 j is defined as the sample size minus the number of estimated parameters. Applying this rule, Figs. 1 and 2 show the number of degrees of freedom for all the deviances involved in our paper. Under the assumption of fixed soft partition across all competing models, R 2 j would share the attractive properties of the adjusted R 2 in Eq. 32. Unfortunately, such an assumption is unrealistic. As said before, the mixture of GLMs needs to be re-estimated every time the set of explanatory variables and/or the number of groups change, and the estimated soft partition changes accordingly. Consequently, the comparison of R 2 j -values from different models makes no sense.
Adjusted overall deviance R 2 . By using the degrees of freedom of RWD and WD reported in Fig. 1, we can define the adjusted overall deviance R 2 as Unfortunately, as for its local counterpart, R 2 in Eq. 34 is not useful to choose between different mixtures of GLMs as the benchmark denominator WD in Eq. 34 changes every time a different model is fitted. This occurs because WD is only one of the components of TD (refer to Fig. 1). Whereby TD is the same for each fitted model, WD changes whenever the soft partition z ij does.
In summary, we can still use the local R 2 in Eq. 33, and the overall R 2 in Eq. 30, as descriptive measures of fit. Instead, we cannot use the adjusted local deviance R 2 in Eq. 33, as well as the adjusted overall R 2 in Eq. 34, in the variable/model selection step.

Simulation Study
This simulation study has the goal of (i) investigating the behavior of the proposed local and overall deviance R 2 measures in Eqs. 29 and 30, and (ii) assessing the role of the three terms in the decomposition of the deviance given in Eq. 26, under the exponential family distributions discussed in Section 3.3, namely, Gaussian, Poisson, and binomial distributions.
The simulation study considers the following set of conditions: (1) the class separation, under two levels: "small" and "large"; (2) the regression fit, under two levels: "poor" and "good"; and (3) the sample size n, under "small" size (100 units) and "large" size (1000 units). As for the class separation, the values "small" and "large" should not be considered in absolute terms, but simply conditioned to the regression fit factor. This results in a completely balanced design with 2 3 = 8 crossed simulation conditions, which are summarized in Table 3.
For each data condition, we generated 250 data sets. To simplify the graphical representations, we take into account a single continuous covariate X (d = 1), generated from a standard normal distribution. The data generating process (DGP) is a mixture of k = 2 GLMs where the distribution of the response variable is assumed to be either Gaussian (Section 6.1), Poisson (Section 6.2) or binomial (Section 6.3). For the sake of space, we provide more insights about the first analysis, while we give brief comments for the other two examples. In the next subsections, we detail the parameters of the DGP and discuss the obtained simulation results for each DGP.
We conduct the whole analysis within the R environment (R Core Team, 2020). To fit mixtures of generalized linear regressions, we exploit the flexmix() function of the Condition 1 "small" "poor" 100 Condition 2 "large" "poor" 100 Condition 3 "small" "good" 100 Condition 4 "large" "good" 100 Condition 5 "small" "poor" 1000 Condition 6 "large" "poor" 1000 Condition 7 "small" "good" 1000 Condition 8 "large" "good" 1000 flexmix package (Leisch, 2004 andLeisch, 2008b). This function implements the EM algorithm to find ML estimates of the parameters. As the focus of the paper is not on computational aspects, we have decided to initialize the EM algorithm using the true partition of the generated data. Nonetheless, it might be of interest for the reader to understand how our measures perform in the real-life situation where cluster labels are unknown. This is why, in one of the three simulation studies-namely, the for the conditional Gaussian DGP-we decided to add a comparison of different initialization strategies for the EM algorithm.
TRUE.DGP. The DGP is used to compute the posterior probabilities of cluster membership of the generated data (soft assignment). Then, the initial (hard) partition is obtained by means of the MAP criterion. TRUE.clusters. For simulated data the true cluster memberships are available, and can be used for initialization. This is our default strategy, the one used in all simulations. Random short-EM. This procedure, suggested by (Biernacki et al., 2003), consists in S short runs of the EM algorithm, each with H iterations, from different random positions. Each run of the EM algorithm is "short" because it is executed for a small number of iterations, without waiting for convergence. Then, the EM algorithm is run from the parameter vector providing the largest likelihood from these short runs of EM. We consider two values for the number of short runs (S ∈ {1, 10}), while we fix H = 5. This gives rise to two alternatives that we name RshortEM.1 (when S = 1) and RshortEM.10 (when S = 10). To implement this initialization strategy, we use the initFlexmix() function of the flexmix package by specifying the arguments init = list("tol.em") and nrep = 1 (for RshortEM.1) and nrep = 10 (for RshortEM.10). Partitional clustering. Partitional clustering algorithms classify observations into k (specified by the user) clusters trying to minimize an objective function. A preliminary definition of "cluster center" is required. The most popular algorithms in this family are the following. k-means. The idea of k-means clustering ( (Forgy, 1965) and (MacQueen, 1967)), which is the most commonly used partitional clustering algorithm, is to partition observations so that the within-cluster sum of squares is minimized; here, each cluster is represented by its center, corresponding to the mean of points assigned to that cluster. We carry out this EM-initialization strategy by performing a k-means cluster analysis 10 times (for stability sake), and subsequently using the partition from the solution with the lowest within-cluster sum of squares. For the purpose, we use the kmeans() function included in the stats package. k-medoids (PAM). In K-medoids clustering, each cluster is represented by one of the data points (called cluster medoid) such that the average dissimilarity between each cluster medoid and all the other members of its cluster is minimal. k-medoids clustering is a robust alternative to k-means clustering. The most common k-medoids clustering method is the partitioning around medoids algorithm (PAM; Kaufman & Rousseeuw, 1990). To implement the PAM-initialization strategy, we use the pam() function included in the cluster package (Maechler et al., 2019).
Both the methods are applied to the whole data (X, Y ).
Gaussian mixtures. The use of Gaussian mixtures to obtain the initial partition is a further alternative. To fit these models, the EM algorithm is commonly used; it, in turn, requires an initialization strategy. To implement the EM algorithm to fit Gaussian mixtures, we refer to the Mclust() function of the mclust package. The latter allows fitting parsimonious variants of Gaussian mixtures. The EM algorithm is initialized according to partitions obtained from agglomerative hierarchical clustering procedures. In the analyses herein, we only consider the unconstrained Gaussian mixture, which is abbreviated as "VVV" in the package. Also in this case, we run the Mclust() function on the whole data (X, Y ).
This comparison is intended to raise the potential user's awareness of how much the initialization stage can impact on the results.

Gaussian case
We report in Table 4 the parameters of the DGP for the Gaussian case, under each of the eight conditions in Table 3. The regression coefficients in clusters 1 and 2 are denoted as β 1 = (β 01 , β 11 ) and β 2 = (β 02 , β 12 ) , respectively. Sample data sets, under each simulation condition, are provided in Fig. 3. The different colors refer to the underlying true classification (blue for cluster 1 and red for cluster 2). Table 5 shows, for each simulation condition, the Monte Carlo averages and standard deviations, over the 250 replications, of the following quantities: adjusted Rand index (ARI; (Hubert & Arabie, 1985)), to evaluate the agreement between the true partition and the MAP  Table 3 Journal of Classification Table 3. Red triangles indicate first group membership, blue squares indicate second group membership partition provided by the fitted model, R 2 1 (the deviance R 2 in cluster 1), R 2 2 (the deviance R 2 in cluster 2), R 2 (the overall deviance R 2 ), the normalized weights D 1 /WD (in cluster 1) and D 2 /WD (in cluster 2) defined in Eq. 31, and the NBD measure of cluster separation on the y-axis we introduced in Eq. 27.

Fig. 3 Gaussian case. Examples of generated data for each simulation condition in
ARI and NBD can both be seen as cluster validation statistics. However, while the former works at an external level (using the external true partition as a benchmark), the latter works at an internal level (see, e.g., Kassambara, 2017, Chapter 13). The ARI values decrease as In other words, the larger the overlap, the greater the difference between true and estimated partitions, the lower the ARI value. The NBD values decrease as the separation, once the points are projected along the y-axis, increases. So, in this case, the larger the "vertical" overlap, the lower the NBD value.
Concerning the evaluation of the average local and overall deviance R 2 s, we recall that the two clusters have an intercept of opposite sign (β 01 = −β 02 ) regardless of the simulation condition, the same slope (β 11 = β 12 = β 1 ) for the simulation conditions 1, 2, 5, and 6, and slopes of opposite sign (β 11 = −β 12 ) for the remaining simulation conditions (see Table 4). However, the standard deviation in cluster 2 is always lower; this yields a generally larger local R 2 in cluster 2 as the regression line fits better the data in this cluster. Moreover, when the absolute value of the slope is larger (refer to the simulation conditions 3, 4, 7, and 8 in Table 3), the local deviance R 2 s are larger too-showing the improvement of the local model with respect to the local intercept-only GLMs.
The normalized weights depend on local deviances D 1 and D 2 ; so, the weight is large in the cluster where the difference between the saturated GLM and the intercept-only GLM is large too. The interpretation of the overall deviance R 2 arises naturally-it is a simple weighted average of the local deviance R 2 s (R 2 1 and R 2 2 ). Apart from the simulation conditions 1 and 5, we do not see any particular effect of the sample size on the obtained results. Finally, Fig. 4 gives a graphical representation of the normalized terms of the deviance decomposition in Eq. 4.1.
As said before, for this simulation case we also compare the results from the use of different initialization strategies. Tables 6-13 report the average values of R 2 1 , R 2 2 , R 2 , D 1 /WD, D 2 /WD, and NBD across the 250 replications for each of the considered simulation conditions in Table 3. Apart from the first two conditions, where the initialization strategy seems to play a role, there is an overall agreement between initialization strategies on all the considered indexes. As for the first two conditions, from Tables 6-7 we note that for a poor  Table 3 Journal of Classification          Table 3 regression fit in each cluster and a small sample size, the initialization strategy has an impact on the determined classification and, as a by-product, on the results of our measures. Moreover, we also note how initializing the EM algorithm with the TRUE.DGP strategy produces worse results.

Poisson case
We report in Table 14 the parameters of the DGP for the Poisson case, under each simulation condition in Table 3.
We give examples of generated data in Fig. 5. Also in this case, we use blue for cluster 1 and red for cluster 2. Table 15 shows the obtained results for each simulation condition. Under the odd simulation conditions, clusters are poorly separated, and this affects cluster recovery-showing relatively small ARI values. Under the simulation conditions 3, 4, 7, and 8, the NBD values are small because, once projected on the y-axis, the clusters overlap in such a way that the fitted meansȳ 1 andȳ 2 are close and, consequently, the difference between the interceptonly model and the local intercept-only model is small (see Fig. 5). The local deviance R 2 s are small when the local slope is close to zero (simulation conditions 1, 2, 5, and 6). This occurs because the fitted and local intercept-only models are somewhat similar. Figure 6 gives a graphical representation of the normalized terms of the deviance decomposition in Eq. 4.1. Also in this case, we do not see any particular effect of the sample size on the obtained results. Table 16 reports the parameters of the DGP for the binomial case, under each simulation condition in Table 3; the number of trials has been fixed to m = 10.

Binomial Case
In Fig. 7, we plot 8 sample data sets for each of the 8 simulation conditions in Table 3.   Table 3. Red triangles indicate first group membership, and blue squares indicate second group membership Table 17 shows the obtained results for each simulation condition. In this example, it is interesting to note the small values of the NBD term under the simulation conditions 3, 4, 7, and 8 (refer to Figs. 7(c), (d), (g), and (h)); in these cases, the clusters overlap a lot when the points are projected on the y-axis and this impacts on the NBD values. Moreover, the local slopes are the same in both clusters (β 11 = β 12 = β 1 ) regardless of the simulation condition: local and overall deviance R 2 s are small when β 1 = 0.01, and large when β 1 = 2.  Table 3 Figure 8 gives a graphical representation of the normalized terms of the deviance decomposition in Eq. 4.1. Also in this case, we do not observe any particular effect of the sample size on the obtained results.

Clusters of COVID-19 Cases in Italy Before and After Social Restrictions
The Coronavirus disease 2019-better known as COVID-19-is caused by the SARS-CoV-2 virus that has appeared in Wuhan (China) in late 2019. Quickly, the virus has spread within and outside China to the entire world. Italy has had 12462 confirmed cases according to the Istituto Superiore di Sanità as of March 11, 2020, and 827 deaths. Only China has recorded more deaths due to this COVID-19 outbreak. Why the spread has been so fast is a major concern. The average age of those who died in Italy was 81 years. More than twothirds of these patients had risk factors like diabetes, cardiovascular diseases, or cancer, or were former smokers (Remuzzi & Remuzzi, 2020). In addition to risk factor prevalences,  Table 3 Journal of Classification  Table 3. Red triangles indicate first group membership, blue squares indicate second group membership the peculiar shape of the Italian territory-very few uninhabited and isolated lands-may have facilitated the COVID-19's pattern of spread.
In this application, we analyze data from the Italian Civil Protection Department ("Dipartimento di Prodezione Civile" in Italian, DPC hereafter) that are publicly available 1 . The The aim of this real-data analysis is to find geographic clusters of COVID-19 prevalence, before and after the introduction of the two most severe social restrictions, respectively enforced in 2020, March 11 (lock-down of all commercial activities and retail sales, except for grocery and staple food stores, and pharmacies), and March 20 (lock-down of city parks)  Table 3 by the Italian Government. To do so, we specify a conditional Poisson mixture, in which the number of positives is regressed on each county's geographic coordinates-offsetting for the total population of each county. The analysis will be carried out separately for the two time points. On the March 11 data set, we select the number of components k ranging from 1 to 4. We choose not to go beyond 4 clusters to avoid overfitting solutions, due to the relatively small sample size (n = 106 counties), and to guarantee class interpretability. For each value of k, we implement the Random short-EM initialization strategy. To guarantee a stable solution, we decided to set the number of replicates to S = 100.
BIC values are 18379. 886, 3843.310, 2305.327, and 1371.599, for k = 1, 2, 3, 4, respectively. For model-based clustering with count data, entropy-based criteria for selecting the mixture order may be used as well. In our real-data example, ICL displays a similar trend to BIC-3845.215, 2309.169, and1394.763, respectively for k = 2, 3, 4. To illustrate our proposal, we will present local, and overall fit measures for all 4 models. For further analyses-regression tables, clustering results and comparison with March 26 data-we will focus on the minimum BIC solution. To avoid label switching and to enhance class interpretation, the estimated posteriors from March 11 data will be used to initialize flexmix() to fit March 26 data.  Table 18 presents class proportions, local, and overall fit measures for March 11 data. We observe an overall R 2 around 0.45 for the 1-class solution, and 0.82 for k = 2. In the 2-class solution, the two classes have very uneven sample sizes (0.169 and 0.831), with the smaller group having a better regression fit than the other one-taking values 0.864 and 0.768, respectively-and also a larger contribution to the overall within deviance.
A similar pattern emerges for k = 3: two small classes with proportions 0.130 and 0.190, and a larger one with proportion 0.680. The two smaller classes deliver a higher local R 2 , and a greater contribution to the overall within deviance than the largest one. Similarly, also the NBD increases.
In the case with four classes (k = 4), there are two large groups of relative sizes 0.520 and 0.262, and two smaller groups with proportions 0.092 and 0.126. Interestingly, we observe that the two smaller groups have a relatively greater local R 2 than the larger groups. Group 3 (withπ 3 = 0.126) has the most relevant contribution to the overall within deviance. Note also that NBD, with respect to the k = 3 case, reduces to 0.366 due to a greater overlap among the groups. Table 19 reports regression results for March 11 data. Both intercepts and regression coefficients differ significantly across group. The first (largest) group has the smallest (statistically significant) latitude coefficient, and a non-significant coefficient for longitude. Counties belonging to this group are spread all over Italy and have a relatively smaller number of positives as of March 11 ( Fig. 9) compared to counties belonging to the other groups.
The second class is the one with the smallest size, and has a relatively large and positive coefficient on latitude, and a negative one for longitude: counties belonging to this group are located in the north-western part of Italy and on average have a relatively larger number of positives as of March 11.
Group 3 is the second largest in terms of average number of positives-some of the northern counties where the pandemic struck the hardest in the first place. Note that some counties in the center, and south of the country are classified to this group as well, although with a smaller number of absolute positives than northern counties from the same group. This is due to the fact that the number of cases per inhabitant are anyway comparable, and the model accounts for this through the offset term.  Fig. 9 Cluster membership, predicted, and observed number of COVID-19-positive cases as of March 11, 2020 The fourth group has a similar effect of latitude on the response as in group 3, but a small and negative coefficient for longitude. We observe from the map (Fig. 9(c)) that groups 3 and 4 are constituted by counties located along highways. Yet, group 4 has an average smaller COVID-19 prevalence than group 3. Overall, the model seems to fit the data well in terms of observed ( Fig. 9(a)) against predicted ( Fig. 9(b)) number of positives. Interestingly, the prevalence of COVID-19 in northern counties is smoothed upward.  Table 20 displays local, overall fit measures, and regression results for March 26 data. The overall pattern is similar to March 22 data, still with noteworthy differences. The biggest class has a relatively lower size, with both local R 2 and D j /WD indicating a better local fit, and a larger contribution to the overall within variation-the class is now more homogeneous. By contrast, class two has grown in size, though with a relatively lower average number of positives ( Fig. 9(c)), and a poorer local regression fit.
On average, counties with largest number of positives relative to inhabitants belong to group 3. Group 4 is the second smallest average number of positives. Interestingly, from the regression output (Table 20) we observe that coefficients for latitude are all smaller than for March 11 data, indicating that the pandemic has spread all over the country (see also Fig. 10(a), (b), and (c)). All in all, it seems that the spread of the virus is relatively more even as of March 26, compared to March 11.

Conclusions
In this paper, we have proposed several deviance-based measures to evaluate the goodness/lack of fit for mixtures of GLMs, at both cluster and whole-sample level; this approach extends usual indices for a single GLM. Our aim was to gain further insights about the fitted model focusing on the behavior within and between clusters.
The proposals have been illustrated by means of a large simulation study that covered Gaussian, Poisson, and binomial responses as special cases of the exponential family distributions, and an application on COVID-19 Italian data. In the COVID-19 data application, we observed that, before and after the two social restriction measures enforced by the Italian government, the spread of the pandemic has become more evenly distributed along the country.
At a more general level, the proposed fit measures can be very useful in applied research with clusterwise regression analysis to assess model fit, as well as to validate the clustering solution(s). Regarding software implementation, a commented R script computing all goodness of fit quantities can be found as an online supplement to this manuscript.
It is important to remark that the use of these measures to perform model selection, namely, to compare models with a different nested/nonnested sets of covariates and/or a different number of latent groups, is completely deceptive. The reason being that different models deliver different soft groupings, and there is no direct link among groupings from different models. Therefore, even if we were able to adjust our measures for the degrees of freedom of each competing model, the resulting comparison would intrinsically not make sense. Such an issue was already noted by (Ingrassia & Punzo, 2020) in the case of R 2 measures for mixtures of linear models.
Although these deviance-based measures have been introduced exploiting the usual EM algorithm, their validity goes beyond it. Our measures also work when the parameters of the model are estimated by variants of the EM algorithm, such as the stochastic EM (SEM; Diebolt & Ip, 1996) or the classification EM (CEM; Celeux & Govaert, 1992). Such algorithms are well-known in the mixture modeling literature -for instance, their implementation is also available in the flexmix package (Grün & Leisch, 2008b).
For both variants, an additional step is added between the E-and M-steps, where the estimated a posteriori probabilities are used to assign each observation to only one component. For the SEM algorithm this assignment is determined in a stochastic way by randomly drawing memberships for each unit i from a multinomial distribution with probabilities z i1 , . . . , z ik , i = 1, . . . , n. By contrast, the assignment is deterministic for the CEM algorithm. Therefore, to extend our fit measures to these cases, we simply need to replace the soft cluster memberships defined in Section 3.2 with hard (crisp) 0/1 memberships.
Future work can focus on the extension of the results of this paper to mixtures of generalized nonlinear models, the family of models recently introduced by Omerovic (2019) in her Ph.D. thesis.
Funding Open access funding provided by Università degli Studi di Catania within the CRUI-CARE Agreement.

Data Availability
The real data set on COVID spread analyzed during the current study is public and can be downloaded from https://opendatamds.maps.arcgis.com/apps/dashboards/0f1c9a02467b45a7b4ca12d 8ba296596.

Consent to Participate
The manuscript does not contain any studies involving human or animal participants performed by any of the authors.

Conflict of Interest
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.